Skip to content

feat(ibis): introduce GCS file connector#1053

Merged
goldmedal merged 5 commits intoCanner:mainfrom
goldmedal:feat/gcs
Feb 5, 2025
Merged

feat(ibis): introduce GCS file connector#1053
goldmedal merged 5 commits intoCanner:mainfrom
goldmedal:feat/gcs

Conversation

@goldmedal
Copy link
Copy Markdown
Contributor

@goldmedal goldmedal commented Feb 4, 2025

Description

  • Introduce the GCS File connector.
  • The usage is the same as the local and s3 file connector but the connection info differs.

URL

/v2/connector/gcs_file
/v3/connector/gcs_file

Connection Info

{
        "url": "/tpch/data",
        "format": "parquet",
        "bucket": "gcs-bucket-name",
        "key_id": "hmackeyid",
        "secret_key": "hmacsercetkey"
        "credentials": "abcdef123456",
}
  • url: The root path of the dataset. (It doesn't include the bucket name)
  • format: The specific file format.
  • bucket: The bucket name.
  • credentials: The credentials of GCP
  • key_id: The HMAC key id of GCP
  • secret_key: The HMAC secret key of GCP

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Introduced support for Google Cloud Storage file integration, allowing users to run queries and manage metadata for files stored in GCS seamlessly.
  • Tests

    • Added a comprehensive suite of tests to validate query functionality, connection handling, and metadata operations for GCS files.
  • Chores

    • Updated CI configurations and test markers to properly include and manage new GCS file scenarios.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 4, 2025

Walkthrough

This pull request introduces support for Google Cloud Storage (GCS) file handling. The changes add a new data source type (DataSource.gcs_file) and include corresponding DTOs and connection information classes. Updates are made in the rewriter, connector, metadata factories, and utility modules to accommodate GCS, and a new test suite with an associated pytest marker is provided. Overall, the PR extends existing functionality to manage GCS files without altering existing error handling or control flow.

Changes

Files Change Summary
ibis-server/app/mdl/rewriter.py Updated _get_write_dialect in Rewriter to return "duckdb" when the data source is gcs_file.
ibis-server/app/model/{__init__.py, connector.py, data_source.py} Introduced QueryGcsFileDTO and GcsFileConnectionInfo; updated DuckDBConnector to handle GCS; added gcs_file to the DataSource enum and linked it in the data source extension.
ibis-server/app/model/metadata/{factory.py, object_storage.py} Added GcsFileMetadata class and updated metadata mapping to support GCS file metadata.
ibis-server/app/model/utils.py Added init_duckdb_gcs function to initialize DuckDB with GCS secrets.
ibis-server/pyproject.toml Added a new pytest marker: "gcs_file: mark a test as a gcs file test".
ibis-server/tests/routers/v2/connector/test_gcs_file.py Added a comprehensive test suite for the GCS file connector covering queries, metadata retrieval, and error cases.

Sequence Diagram(s)

sequenceDiagram
    participant C as Client
    participant R as Rewriter
    participant D as DuckDBConnector
    participant U as Utils (init_duckdb_gcs)
    participant M as MetadataFactory

    C->>R: Submit query with GCS connection info
    R->>D: Determine write dialect ("duckdb") for DataSource.gcs_file
    D->>U: Call init_duckdb_gcs (with GcsFileConnectionInfo)
    U-->>D: Return initialized DuckDB connection
    D->>M: Request metadata handling for GCS
    M-->>D: Return GCS file metadata
Loading

Possibly related PRs

  • feat(ibis): introduce Local file connector #1029: The changes in the main PR, which add support for a new data source option DataSource.gcs_file in the _get_write_dialect method, are related to the modifications in the retrieved PR that also update the same method to handle a different data source, DataSource.local_file, indicating a shared context of enhancing data source handling in the same function.
  • feat(ibis): introduce minio connector #1048: The changes in the main PR, which add support for the DataSource.gcs_file in the _get_write_dialect method, are related to the modifications in the retrieved PR that also update the same method to include DataSource.minio_file, as both involve enhancing the handling of different data sources within the same function.

Suggested reviewers

  • onlyjackfrost
  • wwwy3y3

Poem

I’m a code-hopping rabbit, swift and free,
In grassy lines, GCS blooms for me.
New routes and paths in every byte,
DuckDB secrets shining bright.
With tests and markers in a playful spree,
Hoppin’ through code with joyful glee!
🐰💻

Tip

🌐 Web search-backed reviews and chat
  • We have enabled web search-based reviews and chat for all users. This feature allows CodeRabbit to access the latest documentation and information on the web.
  • You can disable this feature by setting web_search: false in the knowledge_base settings.
  • Please share any feedback in the Discord discussion.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74e53c9 and e8fadf2.

📒 Files selected for processing (1)
  • ibis-server/app/model/__init__.py (4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: ci
🔇 Additional comments (3)
ibis-server/app/model/__init__.py (3)

67-68: LGTM!

The QueryGcsFileDTO class follows the established pattern for file-based DTOs and correctly specifies its connection info type.


174-174: LGTM!

The description now correctly refers to "minio bucket" instead of "s3 bucket", improving accuracy.


187-196: LGTM!

The GcsFileConnectionInfo class correctly implements all required fields for GCS file connections, including both HMAC and service account authentication methods. The class is properly integrated into the ConnectionInfo union type.

Also applies to: 210-210

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added ibis dependencies Pull requests that update a dependency file python Pull requests that update Python code labels Feb 4, 2025
@github-actions github-actions bot added the ci label Feb 4, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
ibis-server/app/model/metadata/object_storage.py (1)

256-256: Fix incorrect logging message.

The logging message indicates "Initialized duckdb minio" but this is in the GCS metadata class.

-        logger.debug("Initialized duckdb minio")
+        logger.debug("Initialized duckdb gcs")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe58251 and 45fd509.

📒 Files selected for processing (9)
  • ibis-server/app/mdl/rewriter.py (1 hunks)
  • ibis-server/app/model/__init__.py (3 hunks)
  • ibis-server/app/model/connector.py (3 hunks)
  • ibis-server/app/model/data_source.py (3 hunks)
  • ibis-server/app/model/metadata/factory.py (2 hunks)
  • ibis-server/app/model/metadata/object_storage.py (3 hunks)
  • ibis-server/app/model/utils.py (2 hunks)
  • ibis-server/pyproject.toml (1 hunks)
  • ibis-server/tests/routers/v2/connector/test_gcs_file.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: ci
🔇 Additional comments (8)
ibis-server/app/model/metadata/factory.py (1)

9-9: LGTM!

The addition of GcsFileMetadata follows the established pattern for file metadata classes, maintaining consistency with existing implementations.

Also applies to: 30-30

ibis-server/app/model/utils.py (1)

53-68: Verify GCS credentials usage.

The function creates a GCS secret but doesn't utilize the credentials field from GcsFileConnectionInfo. This might be required for certain GCS authentication scenarios.

Please verify if the credentials field should be used in the secret creation by checking DuckDB's GCS integration documentation:

✅ Verification successful

GCS HMAC Authentication is Configured Correctly

  • The current implementation follows DuckDB’s documented HMAC authentication method for GCS by creating a secret using KEY_ID and SECRET.
  • DuckDB’s configuration for GCS (via CREATE SECRET) does not require the additional credentials field when using HMAC keys.
  • If you intend to support alternative authentication methods (e.g., using local gcloud CLI credentials), a different approach would be needed.
🏁 Scripts executed

The following scripts were executed for the analysis:

Web query:

How to configure GCS authentication in DuckDB?

Length of output: 5528

ibis-server/app/mdl/rewriter.py (1)

79-79: LGTM!

The addition of DataSource.gcs_file to the DuckDB dialect set follows the established pattern for file-based data sources.

ibis-server/app/model/__init__.py (1)

67-68: LGTM!

The QueryGcsFileDTO class follows the established pattern for file-based query DTOs.

ibis-server/app/model/data_source.py (1)

25-25: LGTM! The GCS file data source is properly integrated.

The changes follow the established pattern for adding new data sources, maintaining consistency with the existing codebase.

Also applies to: 52-52, 79-79

ibis-server/app/model/connector.py (1)

21-21: LGTM! The GCS connector is properly integrated.

The changes follow the established pattern for adding new connectors, maintaining consistency with the existing error handling and initialization patterns.

Also applies to: 28-28, 46-46, 172-173

ibis-server/tests/routers/v2/connector/test_gcs_file.py (1)

1-508: LGTM! Comprehensive test coverage for the GCS connector.

The test suite is well-structured and covers all essential aspects:

  • Basic query functionality
  • Query limits and calculated fields
  • Error handling and edge cases
  • Metadata operations
  • Support for different file formats (parquet, csv, json)
  • Type mapping verification

Good use of environment variables for sensitive data and pytest fixtures for test setup.

ibis-server/pyproject.toml (1)

70-70: GCS Test Marker Addition: Confirm and Document

The new marker
"gcs_file: mark a test as a gcs file test",
has been correctly introduced in the [tool.pytest.ini_options] section. This addition meets the PR objective of supporting tests related to the new GCS file connector.

Please ensure that any related documentation or guidelines for writing tests include details about this marker so that developers know when and how to use it.

@goldmedal goldmedal requested a review from wwwy3y3 February 4, 2025 12:22
@goldmedal goldmedal merged commit 50d1240 into Canner:main Feb 5, 2025
@goldmedal goldmedal deleted the feat/gcs branch February 5, 2025 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci dependencies Pull requests that update a dependency file ibis python Pull requests that update Python code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants